{
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# UniProt data pre-processing for binding site prediction downstream task"
      ],
      "metadata": {
        "id": "T0vNNHzXM9vJ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "This notebook guides you through:\n",
        "\n",
        "* 📥 **Downloading Data**: Retrieve information from the UniProt website, including details on protein families, binding sites, active sites, and amino acid sequences.\n",
        "* 🛠️ **Processing Data**: Handle special symbols (angle brackets and question marks) in binding/active site information and convert this data into binary labels. Each amino acid position in the protein sequences is marked as 1 (binding/active site) or 0 (non-binding/active site).\n",
        "* ✂️ **Splitting Data**: Divide amino acid sequences and their labels into stratified train/test sets based on UniProt protein families.\n",
        "* 🔄 **Chunking Sequences**: Split sequences and their labels into non-overlapping chunks of a specified length to define a context window for the ESM-2 model.\n",
        "\n",
        "This tutorial is made to run without any GPU support, and can be used in Google colab. If you'd like to open this notebook in colab, you can use the following link.\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/UniProt_Data_Preprocessing_for_Binding_Sites.ipynb)"
      ],
      "metadata": {
        "id": "zp7or6X5SoxH"
      }
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ELKeqlhIQx3D"
      },
      "source": [
        "## Download from UniProt"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xHBLk9EDQx3E"
      },
      "source": [
        "Let's first download a dataset of proteins from UniProt. We will obtain a TSV (Tab-Separated Values) file with specific columns such as Protein families, Binding site, Active site, and Sequence. You can achieve this following these steps:\n",
        "\n",
        "- Go to the [UniProt website](https://www.uniprot.org/) and perform a search to query for the proteins of interest (you can search by organism, protein name, function, etc). Filter your results with the filters on the left-hand side to refine your results further if necessary. Here I performed the search: (organism_id:9606) AND (family:kinase) AND (existence:1 OR existence:2) in UniProtKB.\n",
        "\n",
        "- Select columns: Above the search results, there is an option to select the columns you want to be included in your download. Click on the 'Columns' button and a dropdown menu will appear.\n",
        "\n",
        "- Customize columns: In the dropdown menu, you can check the boxes next to the columns you want to include in your TSV file. Look for the 'Protein families', 'Binding site', 'Active site', and 'Sequence' options. I also added further info such as entry name, protein name, gene name, organism, sequence length and whether the entry has been reviewed.\n",
        "\n",
        "- Download the file: After selecting the desired columns, click the 'Download' button located above the search results. Choose the 'Tab-separated' format from the list of available formats. You may also have the option to select the number of entries you want to download (e.g., all entries, displayed entries, or a custom range).\n",
        "Click on the 'Download' button to start the download process and your browser will prompt you to save the TSV file."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FnsNzJ09Qx3E"
      },
      "source": [
        "## Process data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_XfxBxLdQx3E"
      },
      "source": [
        "Now, let's process the downloaded UniProt TSV file with columns (Protein families, Binding site, Active site, Sequence). If the family annotation or binding sites are missing, the code will filter out this sequence. If the Active site annotation is missing, the sequence will be included without issue. Missing sequences are not handled by this notebook."
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "But first, let's set up the environment:"
      ],
      "metadata": {
        "id": "5YS5sXh1RwZ0"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install pandas\n",
        "!pip install numpy\n",
        "!pip install requests"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "v9we_r6KfWXK",
        "outputId": "83d3fd84-e64d-4015-8fae-007f5f2287c7"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (2.0.3)\n",
            "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.4)\n",
            "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2024.1)\n",
            "Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas) (1.25.2)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.25.2)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (2.31.0)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests) (3.3.2)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests) (3.7)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests) (2.0.7)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests) (2024.7.4)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "egKHZwB4Qx3E"
      },
      "outputs": [],
      "source": [
        "# I/O\n",
        "import pandas as pd\n",
        "import numpy as np\n",
        "import re\n",
        "import random\n",
        "import pickle\n",
        "import os\n",
        "import requests\n",
        "import xml.etree.ElementTree as ET\n",
        "# set seed\n",
        "random.seed(42)\n",
        "np.random.seed(42)"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "If you upload the downloaded file from UniProt to Google Drive, you should be able to access it by first mounting your Google Drive and then loading it:"
      ],
      "metadata": {
        "id": "oBAmD-gsURg3"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from google.colab import drive\n",
        "drive.mount('/content/gdrive')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "collapsed": true,
        "id": "TQTPaxM3Rul7",
        "outputId": "1a351a18-4a01-47ae-e54a-170836124f9b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Mounted at /content/gdrive\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ikkAcUQDQx3E",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 397
        },
        "outputId": "f08b84fa-848e-482a-efa0-61614fbf3ae4"
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "        Entry    Reviewed        Entry Name  \\\n",
              "0  A0A087WV00  unreviewed  A0A087WV00_HUMAN   \n",
              "1  A0A090N7W4  unreviewed  A0A090N7W4_HUMAN   \n",
              "2  A0A0S2Z310  unreviewed  A0A0S2Z310_HUMAN   \n",
              "3  A0A0S2Z4D1  unreviewed  A0A0S2Z4D1_HUMAN   \n",
              "4  A0A2P9DU05  unreviewed  A0A2P9DU05_HUMAN   \n",
              "\n",
              "                                       Protein names  \\\n",
              "0  Diacylglycerol kinase (DAG kinase) (EC 2.7.1.107)   \n",
              "1                     Cell division protein kinase 5   \n",
              "2  Serine/threonine-protein kinase receptor (EC 2...   \n",
              "3  non-specific serine/threonine protein kinase (...   \n",
              "4        Rho-associated protein kinase (EC 2.7.11.1)   \n",
              "\n",
              "                 Gene Names              Organism  \\\n",
              "0                      DGKI  Homo sapiens (Human)   \n",
              "1  CDK5 hCG_18690 tcag7.772  Homo sapiens (Human)   \n",
              "2                    ACVRL1  Homo sapiens (Human)   \n",
              "3                     STK11  Homo sapiens (Human)   \n",
              "4                     ROCK2  Homo sapiens (Human)   \n",
              "\n",
              "                                    Protein families  \\\n",
              "0            Eukaryotic diacylglycerol kinase family   \n",
              "1  Protein kinase superfamily, CMGC Ser/Thr prote...   \n",
              "2  Protein kinase superfamily, TKL Ser/Thr protei...   \n",
              "3  Protein kinase superfamily, CAMK Ser/Thr prote...   \n",
              "4  Protein kinase superfamily, AGC Ser/Thr protei...   \n",
              "\n",
              "                                            Sequence  Length  \\\n",
              "0  MDAAGRGCHLLPLPAARGPARAPAAAAAAAASPPGPCSGAACAPSA...    1057   \n",
              "1  MQKYEKLEKIGEGTYGTVFKAKNRETHEIVALKRVRLDDDDEGVPS...     292   \n",
              "2  MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTC...     503   \n",
              "3  MEVVDPQQLGMFTEGELMSVGMDTFIHRIDSTEVIYQPRRKRAKLI...     433   \n",
              "4  MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...    1388   \n",
              "\n",
              "                                        Binding site  \\\n",
              "0                                                NaN   \n",
              "1  BINDING 33; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...   \n",
              "2  BINDING 229; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...   \n",
              "3  BINDING 78; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...   \n",
              "4  BINDING 121; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...   \n",
              "\n",
              "                                         Active site  \n",
              "0                                                NaN  \n",
              "1                                                NaN  \n",
              "2                                                NaN  \n",
              "3                                                NaN  \n",
              "4  ACT_SITE 214; /note=\"Proton acceptor\"; /eviden...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-b826ec50-25f1-4bf4-9997-7db39bd58f37\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Entry</th>\n",
              "      <th>Reviewed</th>\n",
              "      <th>Entry Name</th>\n",
              "      <th>Protein names</th>\n",
              "      <th>Gene Names</th>\n",
              "      <th>Organism</th>\n",
              "      <th>Protein families</th>\n",
              "      <th>Sequence</th>\n",
              "      <th>Length</th>\n",
              "      <th>Binding site</th>\n",
              "      <th>Active site</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>A0A087WV00</td>\n",
              "      <td>unreviewed</td>\n",
              "      <td>A0A087WV00_HUMAN</td>\n",
              "      <td>Diacylglycerol kinase (DAG kinase) (EC 2.7.1.107)</td>\n",
              "      <td>DGKI</td>\n",
              "      <td>Homo sapiens (Human)</td>\n",
              "      <td>Eukaryotic diacylglycerol kinase family</td>\n",
              "      <td>MDAAGRGCHLLPLPAARGPARAPAAAAAAAASPPGPCSGAACAPSA...</td>\n",
              "      <td>1057</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>A0A090N7W4</td>\n",
              "      <td>unreviewed</td>\n",
              "      <td>A0A090N7W4_HUMAN</td>\n",
              "      <td>Cell division protein kinase 5</td>\n",
              "      <td>CDK5 hCG_18690 tcag7.772</td>\n",
              "      <td>Homo sapiens (Human)</td>\n",
              "      <td>Protein kinase superfamily, CMGC Ser/Thr prote...</td>\n",
              "      <td>MQKYEKLEKIGEGTYGTVFKAKNRETHEIVALKRVRLDDDDEGVPS...</td>\n",
              "      <td>292</td>\n",
              "      <td>BINDING 33; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>A0A0S2Z310</td>\n",
              "      <td>unreviewed</td>\n",
              "      <td>A0A0S2Z310_HUMAN</td>\n",
              "      <td>Serine/threonine-protein kinase receptor (EC 2...</td>\n",
              "      <td>ACVRL1</td>\n",
              "      <td>Homo sapiens (Human)</td>\n",
              "      <td>Protein kinase superfamily, TKL Ser/Thr protei...</td>\n",
              "      <td>MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTC...</td>\n",
              "      <td>503</td>\n",
              "      <td>BINDING 229; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>A0A0S2Z4D1</td>\n",
              "      <td>unreviewed</td>\n",
              "      <td>A0A0S2Z4D1_HUMAN</td>\n",
              "      <td>non-specific serine/threonine protein kinase (...</td>\n",
              "      <td>STK11</td>\n",
              "      <td>Homo sapiens (Human)</td>\n",
              "      <td>Protein kinase superfamily, CAMK Ser/Thr prote...</td>\n",
              "      <td>MEVVDPQQLGMFTEGELMSVGMDTFIHRIDSTEVIYQPRRKRAKLI...</td>\n",
              "      <td>433</td>\n",
              "      <td>BINDING 78; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>A0A2P9DU05</td>\n",
              "      <td>unreviewed</td>\n",
              "      <td>A0A2P9DU05_HUMAN</td>\n",
              "      <td>Rho-associated protein kinase (EC 2.7.11.1)</td>\n",
              "      <td>ROCK2</td>\n",
              "      <td>Homo sapiens (Human)</td>\n",
              "      <td>Protein kinase superfamily, AGC Ser/Thr protei...</td>\n",
              "      <td>MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...</td>\n",
              "      <td>1388</td>\n",
              "      <td>BINDING 121; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...</td>\n",
              "      <td>ACT_SITE 214; /note=\"Proton acceptor\"; /eviden...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b826ec50-25f1-4bf4-9997-7db39bd58f37')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-b826ec50-25f1-4bf4-9997-7db39bd58f37 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-b826ec50-25f1-4bf4-9997-7db39bd58f37');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-e90966b3-2755-44ca-b283-e1084844189c\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-e90966b3-2755-44ca-b283-e1084844189c')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-e90966b3-2755-44ca-b283-e1084844189c button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "data",
              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 2191,\n  \"fields\": [\n    {\n      \"column\": \"Entry\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2191,\n        \"samples\": [\n          \"Q6PHR2\",\n          \"A8KAM8\",\n          \"G3V213\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Reviewed\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"reviewed\",\n          \"unreviewed\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Entry Name\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2191,\n        \"samples\": [\n          \"ULK3_HUMAN\",\n          \"A8KAM8_HUMAN\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Protein names\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 986,\n        \"samples\": [\n          \"cAMP-dependent protein kinase type I-alpha regulatory subunit\",\n          \"Cyclin-dependent kinase-like 2 (EC 2.7.11.22) (Protein kinase p56 KKIAMRE) (Serine/threonine-protein kinase KKIAMRE)\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Gene Names\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 977,\n        \"samples\": [\n          \"PHKA1 PHKA\",\n          \"NMRK1 C9orf95 NRK1\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Organism\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          \"Homo sapiens (Human)\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Protein families\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 141,\n        \"samples\": [\n          \"Protein kinase superfamily, Tyr protein kinase family, Tie subfamily\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Sequence\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2122,\n        \"samples\": [\n          \"MSQTSSIGSAESLISLERKKEKNINRDITSRKDLPSRTSNVERKASQQQWGRGNFTEGKVPHIRIENGAAIEEIYTFGRILGKGSFGIVIEATDKETETKWAIKKVNKEKAGSSAVKLLEREVNILKSVKHEHIIHLEQVFETPKKMYLVMELCEDGELKEILDRKGHFSENETRWIIQSLASAIAYLHNNDIVHRDLKLENIMVKSSLIDDNNE\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Length\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 1301,\n        \"min\": 37,\n        \"max\": 35991,\n        \"num_unique_values\": 950,\n        \"samples\": [\n          331\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Binding site\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 830,\n        \"samples\": [\n          \"BINDING 46; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000256|PROSITE-ProRule:PRU10141\\\"\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Active site\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 481,\n        \"samples\": [\n          \"ACT_SITE 2079; /note=\\\"Proton acceptor\\\"; /evidence=\\\"ECO:0000255|PROSITE-ProRule:PRU00159, ECO:0000255|PROSITE-ProRule:PRU10028\\\"\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 4
        }
      ],
      "source": [
        "# Load the dataset\n",
        "file_path = \"/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29.tsv\"\n",
        "data = pd.read_csv(file_path, sep='\\t')\n",
        "data.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AkwTOCn0Qx3F"
      },
      "source": [
        "Now let's extract the required information for the purposes of this task: Protein families, Binding site, Active site, Sequence. Also, let's filter out entries without binding site or protein families information."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "data[\"Binding site\"]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "54cZYSkRsgDz",
        "outputId": "d4bc7c8d-ad84-4022-8407-7b1a67007ef6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0                                                     NaN\n",
              "1       BINDING 33; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...\n",
              "2       BINDING 229; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...\n",
              "3       BINDING 78; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...\n",
              "4       BINDING 121; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...\n",
              "                              ...                        \n",
              "2186                                                  NaN\n",
              "2187                                                  NaN\n",
              "2188                                                  NaN\n",
              "2189    BINDING 73; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...\n",
              "2190    BINDING 165; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...\n",
              "Name: Binding site, Length: 2191, dtype: object"
            ]
          },
          "metadata": {},
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-iSyJ91CQx3F",
        "outputId": "76d2d640-7a80-4083-af61-e60be00801f2",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 224
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(1406, 5)\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "        Entry                                   Protein families  \\\n",
              "1  A0A090N7W4  Protein kinase superfamily, CMGC Ser/Thr prote...   \n",
              "2  A0A0S2Z310  Protein kinase superfamily, TKL Ser/Thr protei...   \n",
              "3  A0A0S2Z4D1  Protein kinase superfamily, CAMK Ser/Thr prote...   \n",
              "4  A0A2P9DU05  Protein kinase superfamily, AGC Ser/Thr protei...   \n",
              "5      A3QNQ0  Protein kinase superfamily, TKL Ser/Thr protei...   \n",
              "\n",
              "                                        Binding site  \\\n",
              "1  BINDING 33; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...   \n",
              "2  BINDING 229; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...   \n",
              "3  BINDING 78; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...   \n",
              "4  BINDING 121; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...   \n",
              "5  BINDING 250..258; /ligand=\"ATP\"; /ligand_id=\"C...   \n",
              "\n",
              "                                         Active site  \\\n",
              "1                                                NaN   \n",
              "2                                                NaN   \n",
              "3                                                NaN   \n",
              "4  ACT_SITE 214; /note=\"Proton acceptor\"; /eviden...   \n",
              "5  ACT_SITE 379; /note=\"Proton acceptor\"; /eviden...   \n",
              "\n",
              "                                            Sequence  \n",
              "1  MQKYEKLEKIGEGTYGTVFKAKNRETHEIVALKRVRLDDDDEGVPS...  \n",
              "2  MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTC...  \n",
              "3  MEVVDPQQLGMFTEGELMSVGMDTFIHRIDSTEVIYQPRRKRAKLI...  \n",
              "4  MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...  \n",
              "5  MGRGLLRGLWPLHIVLWTRIASTIPPHVQKSVNNDMIVTDNNGAVK...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-fa828e7c-1be0-4bc3-9078-3f27591dd30a\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Entry</th>\n",
              "      <th>Protein families</th>\n",
              "      <th>Binding site</th>\n",
              "      <th>Active site</th>\n",
              "      <th>Sequence</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>A0A090N7W4</td>\n",
              "      <td>Protein kinase superfamily, CMGC Ser/Thr prote...</td>\n",
              "      <td>BINDING 33; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>MQKYEKLEKIGEGTYGTVFKAKNRETHEIVALKRVRLDDDDEGVPS...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>A0A0S2Z310</td>\n",
              "      <td>Protein kinase superfamily, TKL Ser/Thr protei...</td>\n",
              "      <td>BINDING 229; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTC...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>A0A0S2Z4D1</td>\n",
              "      <td>Protein kinase superfamily, CAMK Ser/Thr prote...</td>\n",
              "      <td>BINDING 78; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>MEVVDPQQLGMFTEGELMSVGMDTFIHRIDSTEVIYQPRRKRAKLI...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>A0A2P9DU05</td>\n",
              "      <td>Protein kinase superfamily, AGC Ser/Thr protei...</td>\n",
              "      <td>BINDING 121; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...</td>\n",
              "      <td>ACT_SITE 214; /note=\"Proton acceptor\"; /eviden...</td>\n",
              "      <td>MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>A3QNQ0</td>\n",
              "      <td>Protein kinase superfamily, TKL Ser/Thr protei...</td>\n",
              "      <td>BINDING 250..258; /ligand=\"ATP\"; /ligand_id=\"C...</td>\n",
              "      <td>ACT_SITE 379; /note=\"Proton acceptor\"; /eviden...</td>\n",
              "      <td>MGRGLLRGLWPLHIVLWTRIASTIPPHVQKSVNNDMIVTDNNGAVK...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-fa828e7c-1be0-4bc3-9078-3f27591dd30a')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-fa828e7c-1be0-4bc3-9078-3f27591dd30a button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-fa828e7c-1be0-4bc3-9078-3f27591dd30a');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-0e71b5c9-ce1a-4e47-84ce-722c692f6871\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-0e71b5c9-ce1a-4e47-84ce-722c692f6871')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-0e71b5c9-ce1a-4e47-84ce-722c692f6871 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "data",
              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 1406,\n  \"fields\": [\n    {\n      \"column\": \"Entry\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1406,\n        \"samples\": [\n          \"B2RE75\",\n          \"A0A8V8TPW3\",\n          \"A0A5P8NAS4\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Protein families\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 126,\n        \"samples\": [\n          \"Protein kinase superfamily, CMGC Ser/Thr protein kinase family, GSK-3 subfamily\",\n          \"Protein kinase superfamily, CAMK Ser/Thr protein kinase family, CaMK subfamily\",\n          \"Type II pantothenate kinase family; Damage-control phosphatase family, Phosphopantetheine phosphatase II subfamily\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Binding site\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 830,\n        \"samples\": [\n          \"BINDING 46; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000256|PROSITE-ProRule:PRU10141\\\"\",\n          \"BINDING 13; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000255|HAMAP-Rule:MF_03173, ECO:0000269|PubMed:22038794, ECO:0000269|PubMed:27477389, ECO:0007744|PDB:3IIJ, ECO:0007744|PDB:3IIL, ECO:0007744|PDB:5JZV\\\"; BINDING 15; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000255|HAMAP-Rule:MF_03173, ECO:0000269|PubMed:22038794, ECO:0000269|PubMed:27477389, ECO:0007744|PDB:3IIJ, ECO:0007744|PDB:3IIL, ECO:0007744|PDB:5JZV\\\"; BINDING 16; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000255|HAMAP-Rule:MF_03173, ECO:0000269|PubMed:22038794, ECO:0000269|PubMed:27477389, ECO:0007744|PDB:3IIJ, ECO:0007744|PDB:3IIL, ECO:0007744|PDB:5JZV\\\"; BINDING 17; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000255|HAMAP-Rule:MF_03173, ECO:0000269|PubMed:22038794, ECO:0000269|PubMed:27477389, ECO:0007744|PDB:3IIJ, ECO:0007744|PDB:3IIL, ECO:0007744|PDB:5JZV\\\"; BINDING 18; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000255|HAMAP-Rule:MF_03173, ECO:0000269|PubMed:22038794, ECO:0000269|PubMed:27477389, ECO:0007744|PDB:3IIJ, ECO:0007744|PDB:3IIL, ECO:0007744|PDB:5JZV\\\"; BINDING 109; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000255|HAMAP-Rule:MF_03173, ECO:0000269|PubMed:22038794, ECO:0000269|PubMed:27477389, ECO:0007744|PDB:3IIJ, ECO:0007744|PDB:3IIL, ECO:0007744|PDB:5JZV\\\"; BINDING 148; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000255|HAMAP-Rule:MF_03173, ECO:0000269|PubMed:22038794, ECO:0000269|PubMed:27477389, ECO:0007744|PDB:3IIJ, ECO:0007744|PDB:3IIL, ECO:0007744|PDB:5JZV\\\"\",\n          \"BINDING 569..577; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000255|PROSITE-ProRule:PRU00159\\\"; BINDING 608; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000305\\\"\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Active site\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 475,\n        \"samples\": [\n          \"ACT_SITE 256; /note=\\\"Proton acceptor\\\"; /evidence=\\\"ECO:0000256|PIRSR:PIRSR630616-1\\\"\",\n          \"ACT_SITE 174; /note=\\\"Proton acceptor\\\"; /evidence=\\\"ECO:0000256|PIRSR:PIRSR000605-50\\\"\",\n          \"ACT_SITE 406; /note=\\\"Proton acceptor\\\"; /evidence=\\\"ECO:0000250\\\"\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Sequence\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1361,\n        \"samples\": [\n          \"MGHALCVCSRGTVIIDNKRYLFIQKLGEGGFSYVDLVEGLHDGHFYALKRILCHEQQDREEAQREADMHRLFNHPNILRLVAYCLRERGAKHEAWLLLPFFKRGTLWNEIERLKDKGNFLTEDQILWLLLGICRGLEAIHAKGYAHRDLKPTNILLGDEGQPVLMDLGSMNQACIHVEGSRQALTLQDWAAQRCTISYRAPELFSVQSHCVIDERTDVWSLGCVLYAMMFGEGPYDMVFQKGDSVALAVQNQLSIPQSPRHSSALRQLLNSMMTVDPHQRPHIPLLLSQLEALQPPAPGQHTTQI\",\n          \"MSQTSSIGSAESLISLERKKEKNINRDITSRKDLPSRTSNVERKASQQQWGRGNFTEGKVPHIRIENGAAIEEIYTFGRILGKGSFGIVIEATDKETETKWAIKKVNKEKAGSSAVKLLEREVNILKSVKHEHIIHLEQVFETPKKMYLVMELCEDGELKEILDRKGHFSENETRWIIQSLASAIAYLHNNDIVHRDLKLENIMVKSSLIDDNNE\",\n          \"MEKYERIRVVGRGAFGIVHLCLRKADQKLVIIKQIPVEQMTKEERQAAQNECQVLKLLNHPNVIEYYENFLEDKALMTAMEYAPGGTLAEFIQKRCNSLLEEETILHFFVQILLALHHVHTHLILHRDLKTQNILLDKHRMVVKIGDFGISKILSSKSKAYTVVGTPCYISPELCEGKPYNQKSDIWALGCVLYELASLKRAFEAANLPALVLKIMSGTFAPISDRYSPELRQLVLSLLSLEPAQRPPLSHIMAQPLCIRALLNLHTDVGSVRMRRPVQGQRAVLGGRVWAPSGSTGGLRQRETWGKSSLPACRNVRRVFVLRPPSVLQGREVRGPQQHREQDHQCPLQRYPPGTCEASHPTTTVVSVCLGWWAGHPPAAANAQHRGGPGGSWAHAESRRHALWASHPVGGPTPRCRRRQSPSWGSGAATAPVHLAFPGGPVGCDHQARGLWGLLHCLPD\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 6
        }
      ],
      "source": [
        "data = data[[\"Entry\", \"Protein families\", \"Binding site\", \"Active site\", \"Sequence\"]]\n",
        "# Filter out rows with NaN values in the 'Protein families' column nor the 'Binding site' column nor the 'Sequence' column\n",
        "data = data[pd.notna(data['Protein families']) & pd.notna(data['Binding site']) & pd.notna(data['Sequence'])]\n",
        "print(data.shape)\n",
        "data.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dA_aYCzbQx3F"
      },
      "source": [
        "So we have a dataset of 1406 proteins, all having a binding site and information of the aminoacids sequence and the protein family. We download proteins proteins from human and kinase family, however there may still exist subgroups of protein families:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "K0PHRmLiQx3F",
        "outputId": "e6b1239f-fc32-4e63-ee3a-0480dc3f58d2",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 649
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Protein families\n",
            "Protein kinase superfamily                                                             164\n",
            "Protein kinase superfamily, CMGC Ser/Thr protein kinase family, CDC2/CDKX subfamily     96\n",
            "Protein kinase superfamily, STE Ser/Thr protein kinase family, STE20 subfamily          78\n",
            "Protein kinase superfamily, Tyr protein kinase family, Insulin receptor subfamily       73\n",
            "Protein kinase superfamily, CAMK Ser/Thr protein kinase family                          56\n",
            "                                                                                      ... \n",
            "GHMP kinase family, Mevalonate kinase subfamily                                          1\n",
            "Protein kinase superfamily, TKL Ser/Thr protein kinase family, ROCO subfamily            1\n",
            "Glutamate 5-kinase family; Gamma-glutamyl phosphate reductase family                     1\n",
            "Guanylate kinase family                                                                  1\n",
            "GHMP kinase family                                                                       1\n",
            "Length: 126, dtype: int64\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "           Entry                                   Protein families  \\\n",
              "359       Q504Y2                         Protein kinase superfamily   \n",
              "414       Q8IWB6                         Protein kinase superfamily   \n",
              "427       Q8NB16                         Protein kinase superfamily   \n",
              "778   A0A7P0T838                         Protein kinase superfamily   \n",
              "779   A0A7P0T952                         Protein kinase superfamily   \n",
              "...          ...                                                ...   \n",
              "1770      M1VPF4  Protein kinase superfamily, Tyr protein kinase...   \n",
              "21        O00764                           Pyridoxine kinase family   \n",
              "1017      M1V485  SLC34A transporter family; Protein kinase supe...   \n",
              "82        P04183                            Thymidine kinase family   \n",
              "542       Q9NVE7  Type II pantothenate kinase family; Damage-con...   \n",
              "\n",
              "                                           Binding site  \\\n",
              "359   BINDING 144..152; /ligand=\"ATP\"; /ligand_id=\"C...   \n",
              "414   BINDING 233..241; /ligand=\"ATP\"; /ligand_id=\"C...   \n",
              "427   BINDING 209..217; /ligand=\"ATP\"; /ligand_id=\"C...   \n",
              "778   BINDING 71; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...   \n",
              "779   BINDING 71; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...   \n",
              "...                                                 ...   \n",
              "1770  BINDING 358; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...   \n",
              "21    BINDING 12; /ligand=\"pyridoxal\"; /ligand_id=\"C...   \n",
              "1017  BINDING 906; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...   \n",
              "82    BINDING 26..33; /ligand=\"ATP\"; /ligand_id=\"ChE...   \n",
              "542   BINDING 196; /ligand=\"acetyl-CoA\"; /ligand_id=...   \n",
              "\n",
              "                                            Active site  \\\n",
              "359   ACT_SITE 278; /note=\"Proton acceptor\"; /eviden...   \n",
              "414                                                 NaN   \n",
              "427                                                 NaN   \n",
              "778                                                 NaN   \n",
              "779                                                 NaN   \n",
              "...                                                 ...   \n",
              "1770                                                NaN   \n",
              "21    ACT_SITE 235; /note=\"Proton acceptor\"; /eviden...   \n",
              "1017                                                NaN   \n",
              "82    ACT_SITE 98; /note=\"Proton acceptor\"; /evidenc...   \n",
              "542                                                 NaN   \n",
              "\n",
              "                                               Sequence  \n",
              "359   MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...  \n",
              "414   MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...  \n",
              "427   MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...  \n",
              "778   MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...  \n",
              "779   MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...  \n",
              "...                                                 ...  \n",
              "1770  MMEAIKKKMQMLKLDKENALDRAEQAEAEQKQAEERSKQLEDELAA...  \n",
              "21    MEEECRVLSIQSHVIRGYVGNRAATFPLQVLGFEIDAVNSVQFSNH...  \n",
              "1017  MAPWPELGDAQPNPDKYLEGAAGQQPTAPDKSKETNKTDNTEAPVT...  \n",
              "82    MSCINLPTVLPGSPSKTRGQIQVILGPMFSGKSTELMRRVRRFQIA...  \n",
              "542   MAECGASGSGSSGDSLDKSITLPPDEIFRNLENAKRFAIDIGGSLT...  \n",
              "\n",
              "[1406 rows x 5 columns]"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-41104dbd-3b0d-4d7f-a634-c64ed20096f2\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Entry</th>\n",
              "      <th>Protein families</th>\n",
              "      <th>Binding site</th>\n",
              "      <th>Active site</th>\n",
              "      <th>Sequence</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>359</th>\n",
              "      <td>Q504Y2</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>BINDING 144..152; /ligand=\"ATP\"; /ligand_id=\"C...</td>\n",
              "      <td>ACT_SITE 278; /note=\"Proton acceptor\"; /eviden...</td>\n",
              "      <td>MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>414</th>\n",
              "      <td>Q8IWB6</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>BINDING 233..241; /ligand=\"ATP\"; /ligand_id=\"C...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>427</th>\n",
              "      <td>Q8NB16</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>BINDING 209..217; /ligand=\"ATP\"; /ligand_id=\"C...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>778</th>\n",
              "      <td>A0A7P0T838</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>BINDING 71; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>779</th>\n",
              "      <td>A0A7P0T952</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>BINDING 71; /ligand=\"ATP\"; /ligand_id=\"ChEBI:C...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1770</th>\n",
              "      <td>M1VPF4</td>\n",
              "      <td>Protein kinase superfamily, Tyr protein kinase...</td>\n",
              "      <td>BINDING 358; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>MMEAIKKKMQMLKLDKENALDRAEQAEAEQKQAEERSKQLEDELAA...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>21</th>\n",
              "      <td>O00764</td>\n",
              "      <td>Pyridoxine kinase family</td>\n",
              "      <td>BINDING 12; /ligand=\"pyridoxal\"; /ligand_id=\"C...</td>\n",
              "      <td>ACT_SITE 235; /note=\"Proton acceptor\"; /eviden...</td>\n",
              "      <td>MEEECRVLSIQSHVIRGYVGNRAATFPLQVLGFEIDAVNSVQFSNH...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1017</th>\n",
              "      <td>M1V485</td>\n",
              "      <td>SLC34A transporter family; Protein kinase supe...</td>\n",
              "      <td>BINDING 906; /ligand=\"ATP\"; /ligand_id=\"ChEBI:...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>MAPWPELGDAQPNPDKYLEGAAGQQPTAPDKSKETNKTDNTEAPVT...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>82</th>\n",
              "      <td>P04183</td>\n",
              "      <td>Thymidine kinase family</td>\n",
              "      <td>BINDING 26..33; /ligand=\"ATP\"; /ligand_id=\"ChE...</td>\n",
              "      <td>ACT_SITE 98; /note=\"Proton acceptor\"; /evidenc...</td>\n",
              "      <td>MSCINLPTVLPGSPSKTRGQIQVILGPMFSGKSTELMRRVRRFQIA...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>542</th>\n",
              "      <td>Q9NVE7</td>\n",
              "      <td>Type II pantothenate kinase family; Damage-con...</td>\n",
              "      <td>BINDING 196; /ligand=\"acetyl-CoA\"; /ligand_id=...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>MAECGASGSGSSGDSLDKSITLPPDEIFRNLENAKRFAIDIGGSLT...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>1406 rows × 5 columns</p>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-41104dbd-3b0d-4d7f-a634-c64ed20096f2')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-41104dbd-3b0d-4d7f-a634-c64ed20096f2 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-41104dbd-3b0d-4d7f-a634-c64ed20096f2');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-dbd11a08-a4b5-4509-9d3a-a78b782a21f3\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-dbd11a08-a4b5-4509-9d3a-a78b782a21f3')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-dbd11a08-a4b5-4509-9d3a-a78b782a21f3 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "  <div id=\"id_a1631877-e0fa-4060-8a7d-e6d101de2071\">\n",
              "    <style>\n",
              "      .colab-df-generate {\n",
              "        background-color: #E8F0FE;\n",
              "        border: none;\n",
              "        border-radius: 50%;\n",
              "        cursor: pointer;\n",
              "        display: none;\n",
              "        fill: #1967D2;\n",
              "        height: 32px;\n",
              "        padding: 0 0 0 0;\n",
              "        width: 32px;\n",
              "      }\n",
              "\n",
              "      .colab-df-generate:hover {\n",
              "        background-color: #E2EBFA;\n",
              "        box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "        fill: #174EA6;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate {\n",
              "        background-color: #3B4455;\n",
              "        fill: #D2E3FC;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate:hover {\n",
              "        background-color: #434B5C;\n",
              "        box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "        filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "        fill: #FFFFFF;\n",
              "      }\n",
              "    </style>\n",
              "    <button class=\"colab-df-generate\" onclick=\"generateWithVariable('data')\"\n",
              "            title=\"Generate code using this dataframe.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "    <script>\n",
              "      (() => {\n",
              "      const buttonEl =\n",
              "        document.querySelector('#id_a1631877-e0fa-4060-8a7d-e6d101de2071 button.colab-df-generate');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      buttonEl.onclick = () => {\n",
              "        google.colab.notebook.generateWithVariable('data');\n",
              "      }\n",
              "      })();\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "data",
              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 1406,\n  \"fields\": [\n    {\n      \"column\": \"Entry\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1406,\n        \"samples\": [\n          \"D6RBM8\",\n          \"P07333\",\n          \"P80192\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Protein families\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 126,\n        \"samples\": [\n          \"APS kinase family; Sulfate adenylyltransferase family\",\n          \"Protein kinase superfamily, AGC Ser/Thr protein kinase family\",\n          \"Protein kinase superfamily, Ser/Thr protein kinase family, CDC7 subfamily\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Binding site\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 830,\n        \"samples\": [\n          \"BINDING 627..635; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000255|PROSITE-ProRule:PRU00159\\\"; BINDING 653; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000255|PROSITE-ProRule:PRU00159\\\"\",\n          \"BINDING 173; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000256|PROSITE-ProRule:PRU10141\\\"\",\n          \"BINDING 163; /ligand=\\\"ATP\\\"; /ligand_id=\\\"ChEBI:CHEBI:30616\\\"; /evidence=\\\"ECO:0000256|PROSITE-ProRule:PRU10141\\\"\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Active site\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 475,\n        \"samples\": [\n          \"ACT_SITE 360; /note=\\\"Proton acceptor\\\"; /evidence=\\\"ECO:0000255|PROSITE-ProRule:PRU00159\\\"\",\n          \"ACT_SITE 342; /note=\\\"Proton acceptor\\\"; /evidence=\\\"ECO:0000250|UniProtKB:Q9JIH7\\\"\",\n          \"ACT_SITE 137; /note=\\\"Proton acceptor\\\"; /evidence=\\\"ECO:0000255|PROSITE-ProRule:PRU00159, ECO:0000255|PROSITE-ProRule:PRU10027, ECO:0000269|PubMed:15530371\\\"\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Sequence\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1361,\n        \"samples\": [\n          \"MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQGGFGCIYLADMNSSESVGSDAPCVVKVEPSDNGPLFTELKFYQRAAKPEQIQKWIRTRKLKYLGVPKYWGSGLHDKNGKSYRFMIMDRFGSDLQKIYEANAKRFSRKTVLQLSLRILDILEYIHEHEYVHGDIKASNLLLNYKNPDQVYLVDYGLAYRYCPEGVHKEYKEDPKRCHDGTIEFTSIDAHNGVAPSRRGDLEILGYCMIQWLTGHLPWEDNLKDPKYVRDSKIRYRENIASLMDKCFPEKNKPGEIAKYMETVKLLDYTEKPLYENLRDILLQGLKAIGSKDDGKLDLSVVENGGLKAKTITKKRKKEIEESKEPGVEDTEWSNTQTEEAIQTLRSPKEQYIEACLSQRLAAAAMTVQEPERESRSNSDAVNQISFSLFSFDFFLLFYLNCFIFL\",\n          \"MLTRKPSAAAPAAYPTGRGGDSAVRQLQASPGLGAGATRSGVGTGPPSPIALPPLRASNAAAAAHTIGGSKHTMNDHLHVGSHAHGQIQVQQLFEDNSNKRTVLTTQPNGLTTVGKTGLPVVPERQLDSIHRRQGSSTSLKSMEGMGKVKATPMTPEQAMKQYMQKLTAFEHHEIFSYPEIYFLGLNAKKRQGMTGGPNNGGYDDDQGSYVQVPHDHVAYRYEVLKVIGKGSFGQVVKAYDHKVHQHVALKMVRNEKRFHRQAAEEIRILEHLRKQDKDNTMNVIHMLENFTFRNHICMTFELLSMNLYELIKKNKFQGFSLPLVRKFAHSILQCLDALHKNRIIHCDLKPENILLKQQGRSGIKVIDFGSSCYEHQRVYTYIQSRFYRAPEVILGARYGMPIDMWSLGCILAELLTGYPLLPGEDEGDQLACMIELLGMPSQKLLDASKRAKNFVSSKGYPRYCTVTTLSDGSVVLNGGRSRRGKLRGPPESREWGNALKGCDDPLFLDFLKQCLEWDPAVRMTPGQALRHPWLRRRLPKPPTGEKTSVKRITESTGAITSISKLPPPSSSASKLRTNLAQMTDANGNIQQRTVLPKLVS\",\n          \"MERRASETPEDGDPEEDTATALQRLVELTTSRVTPVRSLRDQYHLIRKLGSGSYGRVLLAQPHQGGPAVALKLLRRDLVLRSTFLREFCVGRCVSAHPGLLQTLAGPLQTPRYFAFAQEYAPCGDLSGMLQERGLPELLVKRVVAQLAGALDFLHSRGLVHADVKPDNVLVFDPVCSRVALGDLGLTRPEGSPTPAPPVPLPTAPPELCLLLPPDTLPLRPAVDSWGLGVLLFCAATACFPWDVALAPNPEFEAFAGWVTTKPQPPQPPPPWDQFAPPALALLQGLLDLDPETRSPPLAVLDFLGDDWGLQGNREGPGVLGSAVSYEDREEGGSSLEEWTDEGDDSKSGGRTGTDGGAP\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 7
        }
      ],
      "source": [
        "# Group the data by 'Protein families' and get the size of each group\n",
        "family_sizes = data.groupby('Protein families').size()\n",
        "print(family_sizes.sort_values(ascending=False))\n",
        "\n",
        "# Create a new column with the size of each family and sort by 'Family size' in descending order and then by 'Protein families'\n",
        "data['Family size'] = data['Protein families'].map(family_sizes)\n",
        "data = data.sort_values(by=['Family size', 'Protein families'], ascending=[False, True])\n",
        "data.drop(columns='Family size', inplace=True) # Drop the 'Family size' column as it is no longer needed\n",
        "data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "c7JDj8dSQx3F"
      },
      "source": [
        "Now let's make the binding and active sites information clearer:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0FDSLc29Qx3F",
        "outputId": "b23535eb-cfad-4737-c58d-55ed11d858cf",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "          Entry            Protein families   Binding site Active site  \\\n",
              "359      Q504Y2  Protein kinase superfamily  144..152; 166         278   \n",
              "414      Q8IWB6  Protein kinase superfamily  233..241; 273        None   \n",
              "427      Q8NB16  Protein kinase superfamily  209..217; 230        None   \n",
              "778  A0A7P0T838  Protein kinase superfamily             71        None   \n",
              "779  A0A7P0T952  Protein kinase superfamily             71        None   \n",
              "\n",
              "                                              Sequence  \n",
              "359  MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...  \n",
              "414  MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...  \n",
              "427  MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...  \n",
              "778  MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...  \n",
              "779  MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-117c4736-5ab6-4c7b-a539-1d7a599d2202\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Entry</th>\n",
              "      <th>Protein families</th>\n",
              "      <th>Binding site</th>\n",
              "      <th>Active site</th>\n",
              "      <th>Sequence</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>359</th>\n",
              "      <td>Q504Y2</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>144..152; 166</td>\n",
              "      <td>278</td>\n",
              "      <td>MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>414</th>\n",
              "      <td>Q8IWB6</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>233..241; 273</td>\n",
              "      <td>None</td>\n",
              "      <td>MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>427</th>\n",
              "      <td>Q8NB16</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>209..217; 230</td>\n",
              "      <td>None</td>\n",
              "      <td>MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>778</th>\n",
              "      <td>A0A7P0T838</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>71</td>\n",
              "      <td>None</td>\n",
              "      <td>MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>779</th>\n",
              "      <td>A0A7P0T952</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>71</td>\n",
              "      <td>None</td>\n",
              "      <td>MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-117c4736-5ab6-4c7b-a539-1d7a599d2202')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-117c4736-5ab6-4c7b-a539-1d7a599d2202 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-117c4736-5ab6-4c7b-a539-1d7a599d2202');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-21462b0b-15dc-4f1c-89d7-f27790d48613\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-21462b0b-15dc-4f1c-89d7-f27790d48613')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-21462b0b-15dc-4f1c-89d7-f27790d48613 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "data",
              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 1406,\n  \"fields\": [\n    {\n      \"column\": \"Entry\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1406,\n        \"samples\": [\n          \"D6RBM8\",\n          \"P07333\",\n          \"P80192\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Protein families\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 126,\n        \"samples\": [\n          \"APS kinase family; Sulfate adenylyltransferase family\",\n          \"Protein kinase superfamily, AGC Ser/Thr protein kinase family\",\n          \"Protein kinase superfamily, Ser/Thr protein kinase family, CDC7 subfamily\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Binding site\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 758,\n        \"samples\": [\n          \"85..92; 859\",\n          \"168\",\n          \"208..216; 229\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Active site\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 362,\n        \"samples\": [\n          \"607\",\n          \"147\",\n          \"265\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Sequence\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1361,\n        \"samples\": [\n          \"MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQGGFGCIYLADMNSSESVGSDAPCVVKVEPSDNGPLFTELKFYQRAAKPEQIQKWIRTRKLKYLGVPKYWGSGLHDKNGKSYRFMIMDRFGSDLQKIYEANAKRFSRKTVLQLSLRILDILEYIHEHEYVHGDIKASNLLLNYKNPDQVYLVDYGLAYRYCPEGVHKEYKEDPKRCHDGTIEFTSIDAHNGVAPSRRGDLEILGYCMIQWLTGHLPWEDNLKDPKYVRDSKIRYRENIASLMDKCFPEKNKPGEIAKYMETVKLLDYTEKPLYENLRDILLQGLKAIGSKDDGKLDLSVVENGGLKAKTITKKRKKEIEESKEPGVEDTEWSNTQTEEAIQTLRSPKEQYIEACLSQRLAAAAMTVQEPERESRSNSDAVNQISFSLFSFDFFLLFYLNCFIFL\",\n          \"MLTRKPSAAAPAAYPTGRGGDSAVRQLQASPGLGAGATRSGVGTGPPSPIALPPLRASNAAAAAHTIGGSKHTMNDHLHVGSHAHGQIQVQQLFEDNSNKRTVLTTQPNGLTTVGKTGLPVVPERQLDSIHRRQGSSTSLKSMEGMGKVKATPMTPEQAMKQYMQKLTAFEHHEIFSYPEIYFLGLNAKKRQGMTGGPNNGGYDDDQGSYVQVPHDHVAYRYEVLKVIGKGSFGQVVKAYDHKVHQHVALKMVRNEKRFHRQAAEEIRILEHLRKQDKDNTMNVIHMLENFTFRNHICMTFELLSMNLYELIKKNKFQGFSLPLVRKFAHSILQCLDALHKNRIIHCDLKPENILLKQQGRSGIKVIDFGSSCYEHQRVYTYIQSRFYRAPEVILGARYGMPIDMWSLGCILAELLTGYPLLPGEDEGDQLACMIELLGMPSQKLLDASKRAKNFVSSKGYPRYCTVTTLSDGSVVLNGGRSRRGKLRGPPESREWGNALKGCDDPLFLDFLKQCLEWDPAVRMTPGQALRHPWLRRRLPKPPTGEKTSVKRITESTGAITSISKLPPPSSSASKLRTNLAQMTDANGNIQQRTVLPKLVS\",\n          \"MERRASETPEDGDPEEDTATALQRLVELTTSRVTPVRSLRDQYHLIRKLGSGSYGRVLLAQPHQGGPAVALKLLRRDLVLRSTFLREFCVGRCVSAHPGLLQTLAGPLQTPRYFAFAQEYAPCGDLSGMLQERGLPELLVKRVVAQLAGALDFLHSRGLVHADVKPDNVLVFDPVCSRVALGDLGLTRPEGSPTPAPPVPLPTAPPELCLLLPPDTLPLRPAVDSWGLGVLLFCAATACFPWDVALAPNPEFEAFAGWVTTKPQPPQPPPPWDQFAPPALALLQGLLDLDPETRSPPLAVLDFLGDDWGLQGNREGPGVLGSAVSYEDREEGGSSLEEWTDEGDDSKSGGRTGTDGGAP\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 8
        }
      ],
      "source": [
        "# Extract the location from the binding and active site columns\n",
        "def extract_location(site_info):\n",
        "    if pd.isnull(site_info):\n",
        "        return None\n",
        "    locations = []\n",
        "    for info in site_info.split(';'):\n",
        "        if 'BINDING' in info or 'ACT_SITE' in info:\n",
        "            locations.append(info.split()[1])\n",
        "    return '; '.join(locations)\n",
        "\n",
        "# Apply the function to the 'Binding site' and 'Active site' columns to extract the locations\n",
        "data['Binding site'] = data['Binding site'].apply(extract_location)\n",
        "data['Active site'] = data['Active site'].apply(extract_location)\n",
        "\n",
        "# Display the first few rows of the modified dataframe\n",
        "data.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "S4Qz-hMRQx3F",
        "outputId": "25148693-66c1-427c-f6ec-8c91a4e7f845",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "          Entry            Protein families   Binding site Active site  \\\n",
              "359      Q504Y2  Protein kinase superfamily  144..152; 166         278   \n",
              "414      Q8IWB6  Protein kinase superfamily  233..241; 273        None   \n",
              "427      Q8NB16  Protein kinase superfamily  209..217; 230        None   \n",
              "778  A0A7P0T838  Protein kinase superfamily             71        None   \n",
              "779  A0A7P0T952  Protein kinase superfamily             71        None   \n",
              "\n",
              "                                              Sequence  Binding-Active site  \n",
              "359  MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...   144..152; 166; 278  \n",
              "414  MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...  233..241; 273; None  \n",
              "427  MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...  209..217; 230; None  \n",
              "778  MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...             71; None  \n",
              "779  MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...             71; None  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-7cc7e55b-40be-4416-b1f9-442373789d52\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Entry</th>\n",
              "      <th>Protein families</th>\n",
              "      <th>Binding site</th>\n",
              "      <th>Active site</th>\n",
              "      <th>Sequence</th>\n",
              "      <th>Binding-Active site</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>359</th>\n",
              "      <td>Q504Y2</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>144..152; 166</td>\n",
              "      <td>278</td>\n",
              "      <td>MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...</td>\n",
              "      <td>144..152; 166; 278</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>414</th>\n",
              "      <td>Q8IWB6</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>233..241; 273</td>\n",
              "      <td>None</td>\n",
              "      <td>MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...</td>\n",
              "      <td>233..241; 273; None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>427</th>\n",
              "      <td>Q8NB16</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>209..217; 230</td>\n",
              "      <td>None</td>\n",
              "      <td>MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...</td>\n",
              "      <td>209..217; 230; None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>778</th>\n",
              "      <td>A0A7P0T838</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>71</td>\n",
              "      <td>None</td>\n",
              "      <td>MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...</td>\n",
              "      <td>71; None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>779</th>\n",
              "      <td>A0A7P0T952</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>71</td>\n",
              "      <td>None</td>\n",
              "      <td>MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...</td>\n",
              "      <td>71; None</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-7cc7e55b-40be-4416-b1f9-442373789d52')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-7cc7e55b-40be-4416-b1f9-442373789d52 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-7cc7e55b-40be-4416-b1f9-442373789d52');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-e041f621-57c2-4b2c-af1d-a31b8c0faa83\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-e041f621-57c2-4b2c-af1d-a31b8c0faa83')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-e041f621-57c2-4b2c-af1d-a31b8c0faa83 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "data",
              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 1406,\n  \"fields\": [\n    {\n      \"column\": \"Entry\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1406,\n        \"samples\": [\n          \"D6RBM8\",\n          \"P07333\",\n          \"P80192\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Protein families\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 126,\n        \"samples\": [\n          \"APS kinase family; Sulfate adenylyltransferase family\",\n          \"Protein kinase superfamily, AGC Ser/Thr protein kinase family\",\n          \"Protein kinase superfamily, Ser/Thr protein kinase family, CDC7 subfamily\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Binding site\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 758,\n        \"samples\": [\n          \"85..92; 859\",\n          \"168\",\n          \"208..216; 229\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Active site\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 362,\n        \"samples\": [\n          \"607\",\n          \"147\",\n          \"265\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Sequence\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1361,\n        \"samples\": [\n          \"MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQGGFGCIYLADMNSSESVGSDAPCVVKVEPSDNGPLFTELKFYQRAAKPEQIQKWIRTRKLKYLGVPKYWGSGLHDKNGKSYRFMIMDRFGSDLQKIYEANAKRFSRKTVLQLSLRILDILEYIHEHEYVHGDIKASNLLLNYKNPDQVYLVDYGLAYRYCPEGVHKEYKEDPKRCHDGTIEFTSIDAHNGVAPSRRGDLEILGYCMIQWLTGHLPWEDNLKDPKYVRDSKIRYRENIASLMDKCFPEKNKPGEIAKYMETVKLLDYTEKPLYENLRDILLQGLKAIGSKDDGKLDLSVVENGGLKAKTITKKRKKEIEESKEPGVEDTEWSNTQTEEAIQTLRSPKEQYIEACLSQRLAAAAMTVQEPERESRSNSDAVNQISFSLFSFDFFLLFYLNCFIFL\",\n          \"MLTRKPSAAAPAAYPTGRGGDSAVRQLQASPGLGAGATRSGVGTGPPSPIALPPLRASNAAAAAHTIGGSKHTMNDHLHVGSHAHGQIQVQQLFEDNSNKRTVLTTQPNGLTTVGKTGLPVVPERQLDSIHRRQGSSTSLKSMEGMGKVKATPMTPEQAMKQYMQKLTAFEHHEIFSYPEIYFLGLNAKKRQGMTGGPNNGGYDDDQGSYVQVPHDHVAYRYEVLKVIGKGSFGQVVKAYDHKVHQHVALKMVRNEKRFHRQAAEEIRILEHLRKQDKDNTMNVIHMLENFTFRNHICMTFELLSMNLYELIKKNKFQGFSLPLVRKFAHSILQCLDALHKNRIIHCDLKPENILLKQQGRSGIKVIDFGSSCYEHQRVYTYIQSRFYRAPEVILGARYGMPIDMWSLGCILAELLTGYPLLPGEDEGDQLACMIELLGMPSQKLLDASKRAKNFVSSKGYPRYCTVTTLSDGSVVLNGGRSRRGKLRGPPESREWGNALKGCDDPLFLDFLKQCLEWDPAVRMTPGQALRHPWLRRRLPKPPTGEKTSVKRITESTGAITSISKLPPPSSSASKLRTNLAQMTDANGNIQQRTVLPKLVS\",\n          \"MERRASETPEDGDPEEDTATALQRLVELTTSRVTPVRSLRDQYHLIRKLGSGSYGRVLLAQPHQGGPAVALKLLRRDLVLRSTFLREFCVGRCVSAHPGLLQTLAGPLQTPRYFAFAQEYAPCGDLSGMLQERGLPELLVKRVVAQLAGALDFLHSRGLVHADVKPDNVLVFDPVCSRVALGDLGLTRPEGSPTPAPPVPLPTAPPELCLLLPPDTLPLRPAVDSWGLGVLLFCAATACFPWDVALAPNPEFEAFAGWVTTKPQPPQPPPPWDQFAPPALALLQGLLDLDPETRSPPLAVLDFLGDDWGLQGNREGPGVLGSAVSYEDREEGGSSLEEWTDEGDDSKSGGRTGTDGGAP\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Binding-Active site\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 843,\n        \"samples\": [\n          \"95..103; 118; 212\",\n          \"505; 615; 628; 610\",\n          \"681..689; 707; 800\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 9
        }
      ],
      "source": [
        "# Create a new column that combines the 'Binding site' and 'Active site' columns\n",
        "data['Binding-Active site'] = data['Binding site'].astype(str) + '; ' + data['Active site'].astype(str)\n",
        "# Replace 'nan' values with None\n",
        "data['Binding-Active site'] = data['Binding-Active site'].replace('nan; nan', None)\n",
        "\n",
        "data.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cUh_y6FeQx3F"
      },
      "source": [
        "### Angle bracket symbols in Binding/Active site"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fSMsOQIoQx3F"
      },
      "source": [
        "In biological databases like UniProt, you may encounter entries in the \"Binding site\" or \"Active site\" columns (or any other feature-related columns) that contain symbols like '<' or '>', these typically indicate positional uncertainty or boundaries that are outside the range of the sequence currently being annotated:\n",
        "\n",
        "- '<': This symbol is used to indicate that the feature (such as a binding or active site) starts before the position given. For example, if you see \"<5\" in the context of a binding site, it suggests that the binding site starts before amino acid position 5 in the protein sequence.\n",
        "\n",
        "- '>': Conversely, this symbol is used to show that the feature extends beyond the position given. If you see \">200\" for an active site, it implies that the active site extends beyond amino acid position 200.\n",
        "\n",
        "These annotations provide information about the location of certain functional sites within a protein, but with an acknowledgment of some level of uncertainty or incompleteness in the data that could be due to various reasons, such as limitations in experimental data, partial protein sequences, or predictions based on related proteins rather than direct evidence.\n",
        "\n",
        "We will filter out entries containing these symbols so as to work with a dataset with certainty on the binding/active sites."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XdJw2qSEQx3F",
        "outputId": "1ab8ec95-16c6-4d1e-edd3-1d4e4a9ca1c2",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Number of entries with angle brackets: 0\n",
            "Number of remaining rows: 1406\n"
          ]
        }
      ],
      "source": [
        "# Find entries containing '<' or '>'\n",
        "entries_angles = data['Binding-Active site'].str.contains('<|>', na=False)\n",
        "print(f\"Number of entries with angle brackets: {entries_angles.sum()}\")\n",
        "\n",
        "# Remove all rows where the \"Binding-Active site\" column contains '<' or '>'\n",
        "data = data[~entries_angles]\n",
        "print(f\"Number of remaining rows: {data.shape[0]}\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "inJN9W9_Qx3F"
      },
      "source": [
        "### Question mark (\"?\") symbols in Binding/Active site"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "O1C_md95Qx3F"
      },
      "source": [
        "In biological databases like UniProt, a question mark (\"?\") in the \"Binding site\" or \"Active site\" columns typically indicates uncertainty or incomplete information regarding the feature in question. It might mean the exact position of the binding or active site within the protein sequence may not be clearly determined, or it may be a predicted feature based on computational models or inferred from homologous proteins, but not yet experimentally verified. It can also be due to conflicting data or interpretations about the presence or characteristics of the site, or the annotation process just being incomplete."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GcNA0PYxQx3F",
        "outputId": "0a342079-3bf3-43f3-94e9-d4f0844ceb98",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Number of entries with angle brackets: 0\n",
            "Number of remaining rows: 1406\n"
          ]
        }
      ],
      "source": [
        "# Find rows where the \"Binding-Active site\" column contains the character \"?\", treating \"?\" as a literal character\n",
        "entries_question_mark = data[data['Binding-Active site'].str.contains('\\?', na=False, regex=True)]\n",
        "print(f\"Number of entries with angle brackets: {entries_question_mark.shape[0]}\")\n",
        "\n",
        "# Remove all rows containing '?' in the \"Binding-Active site\" column\n",
        "data = data.drop(entries_question_mark.index)\n",
        "print(f\"Number of remaining rows: {data.shape[0]}\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Wcgmb6XPQx3F"
      },
      "source": [
        "### Binding/active sites labels"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ogYkvIHCQx3F"
      },
      "source": [
        "Now let's define all aminoacids involved in binding/active sites by expanding the ranges to especify all amino acid indexes that are a binding/active site:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NdXC0E0JQx3F",
        "outputId": "38c6e7b6-1e1c-4163-f9b5-9c68b8fb38cc",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "          Entry            Protein families   Binding site Active site  \\\n",
            "359      Q504Y2  Protein kinase superfamily  144..152; 166         278   \n",
            "414      Q8IWB6  Protein kinase superfamily  233..241; 273        None   \n",
            "427      Q8NB16  Protein kinase superfamily  209..217; 230        None   \n",
            "778  A0A7P0T838  Protein kinase superfamily             71        None   \n",
            "779  A0A7P0T952  Protein kinase superfamily             71        None   \n",
            "\n",
            "                                              Sequence  \\\n",
            "359  MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...   \n",
            "414  MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...   \n",
            "427  MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...   \n",
            "778  MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...   \n",
            "779  MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...   \n",
            "\n",
            "                                   Binding-Active site  \n",
            "359  144, 145, 146, 147, 148, 149, 150, 151, 152; 1...  \n",
            "414  233, 234, 235, 236, 237, 238, 239, 240, 241; 2...  \n",
            "427  209, 210, 211, 212, 213, 214, 215, 216, 217; 2...  \n",
            "778                                           71; None  \n",
            "779                                           71; None  \n"
          ]
        }
      ],
      "source": [
        "def expand_ranges(s):\n",
        "    \"\"\"Expand ranges into a comma-separated string.\"\"\"\n",
        "    return re.sub(r'(\\d+)\\.\\.(\\d+)', lambda m: ', '.join(map(str, range(int(m.group(1)), int(m.group(2))+1))), str(s))\n",
        "\n",
        "data['Binding-Active site'] = data['Binding-Active site'].apply(expand_ranges)\n",
        "print(data.head())"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "You can now convert the binding/active sites information into a binary label: 1s where there is a binding/active site; 0s where there is not. Retrieve the indices in 'Bindig/active site' column, and set their corresponding positions in the protein sequence to 1. All other aminoacids of the sequence are set to 0:\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "70UzFrlGWxYY"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QNymcWJTQx3F",
        "outputId": "15e8c560-9fe2-47e7-cdf9-98c83e0acfd1",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "          Entry            Protein families   Binding site Active site  \\\n",
              "359      Q504Y2  Protein kinase superfamily  144..152; 166         278   \n",
              "414      Q8IWB6  Protein kinase superfamily  233..241; 273        None   \n",
              "427      Q8NB16  Protein kinase superfamily  209..217; 230        None   \n",
              "778  A0A7P0T838  Protein kinase superfamily             71        None   \n",
              "779  A0A7P0T952  Protein kinase superfamily             71        None   \n",
              "\n",
              "                                              Sequence  \\\n",
              "359  MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...   \n",
              "414  MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...   \n",
              "427  MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...   \n",
              "778  MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...   \n",
              "779  MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...   \n",
              "\n",
              "                                   Binding-Active site  \n",
              "359  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  \n",
              "414  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  \n",
              "427  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  \n",
              "778  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  \n",
              "779  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-7ea5b4ae-0460-4711-b3ed-56b171c61c76\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Entry</th>\n",
              "      <th>Protein families</th>\n",
              "      <th>Binding site</th>\n",
              "      <th>Active site</th>\n",
              "      <th>Sequence</th>\n",
              "      <th>Binding-Active site</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>359</th>\n",
              "      <td>Q504Y2</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>144..152; 166</td>\n",
              "      <td>278</td>\n",
              "      <td>MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...</td>\n",
              "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>414</th>\n",
              "      <td>Q8IWB6</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>233..241; 273</td>\n",
              "      <td>None</td>\n",
              "      <td>MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...</td>\n",
              "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>427</th>\n",
              "      <td>Q8NB16</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>209..217; 230</td>\n",
              "      <td>None</td>\n",
              "      <td>MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...</td>\n",
              "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>778</th>\n",
              "      <td>A0A7P0T838</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>71</td>\n",
              "      <td>None</td>\n",
              "      <td>MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...</td>\n",
              "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>779</th>\n",
              "      <td>A0A7P0T952</td>\n",
              "      <td>Protein kinase superfamily</td>\n",
              "      <td>71</td>\n",
              "      <td>None</td>\n",
              "      <td>MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...</td>\n",
              "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-7ea5b4ae-0460-4711-b3ed-56b171c61c76')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-7ea5b4ae-0460-4711-b3ed-56b171c61c76 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-7ea5b4ae-0460-4711-b3ed-56b171c61c76');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-b15a51dd-6168-41b0-ae00-d9341d809622\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-b15a51dd-6168-41b0-ae00-d9341d809622')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-b15a51dd-6168-41b0-ae00-d9341d809622 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "data",
              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 1406,\n  \"fields\": [\n    {\n      \"column\": \"Entry\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1406,\n        \"samples\": [\n          \"D6RBM8\",\n          \"P07333\",\n          \"P80192\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Protein families\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 126,\n        \"samples\": [\n          \"APS kinase family; Sulfate adenylyltransferase family\",\n          \"Protein kinase superfamily, AGC Ser/Thr protein kinase family\",\n          \"Protein kinase superfamily, Ser/Thr protein kinase family, CDC7 subfamily\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Binding site\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 758,\n        \"samples\": [\n          \"85..92; 859\",\n          \"168\",\n          \"208..216; 229\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Active site\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 362,\n        \"samples\": [\n          \"607\",\n          \"147\",\n          \"265\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Sequence\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1361,\n        \"samples\": [\n          \"MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQGGFGCIYLADMNSSESVGSDAPCVVKVEPSDNGPLFTELKFYQRAAKPEQIQKWIRTRKLKYLGVPKYWGSGLHDKNGKSYRFMIMDRFGSDLQKIYEANAKRFSRKTVLQLSLRILDILEYIHEHEYVHGDIKASNLLLNYKNPDQVYLVDYGLAYRYCPEGVHKEYKEDPKRCHDGTIEFTSIDAHNGVAPSRRGDLEILGYCMIQWLTGHLPWEDNLKDPKYVRDSKIRYRENIASLMDKCFPEKNKPGEIAKYMETVKLLDYTEKPLYENLRDILLQGLKAIGSKDDGKLDLSVVENGGLKAKTITKKRKKEIEESKEPGVEDTEWSNTQTEEAIQTLRSPKEQYIEACLSQRLAAAAMTVQEPERESRSNSDAVNQISFSLFSFDFFLLFYLNCFIFL\",\n          \"MLTRKPSAAAPAAYPTGRGGDSAVRQLQASPGLGAGATRSGVGTGPPSPIALPPLRASNAAAAAHTIGGSKHTMNDHLHVGSHAHGQIQVQQLFEDNSNKRTVLTTQPNGLTTVGKTGLPVVPERQLDSIHRRQGSSTSLKSMEGMGKVKATPMTPEQAMKQYMQKLTAFEHHEIFSYPEIYFLGLNAKKRQGMTGGPNNGGYDDDQGSYVQVPHDHVAYRYEVLKVIGKGSFGQVVKAYDHKVHQHVALKMVRNEKRFHRQAAEEIRILEHLRKQDKDNTMNVIHMLENFTFRNHICMTFELLSMNLYELIKKNKFQGFSLPLVRKFAHSILQCLDALHKNRIIHCDLKPENILLKQQGRSGIKVIDFGSSCYEHQRVYTYIQSRFYRAPEVILGARYGMPIDMWSLGCILAELLTGYPLLPGEDEGDQLACMIELLGMPSQKLLDASKRAKNFVSSKGYPRYCTVTTLSDGSVVLNGGRSRRGKLRGPPESREWGNALKGCDDPLFLDFLKQCLEWDPAVRMTPGQALRHPWLRRRLPKPPTGEKTSVKRITESTGAITSISKLPPPSSSASKLRTNLAQMTDANGNIQQRTVLPKLVS\",\n          \"MERRASETPEDGDPEEDTATALQRLVELTTSRVTPVRSLRDQYHLIRKLGSGSYGRVLLAQPHQGGPAVALKLLRRDLVLRSTFLREFCVGRCVSAHPGLLQTLAGPLQTPRYFAFAQEYAPCGDLSGMLQERGLPELLVKRVVAQLAGALDFLHSRGLVHADVKPDNVLVFDPVCSRVALGDLGLTRPEGSPTPAPPVPLPTAPPELCLLLPPDTLPLRPAVDSWGLGVLLFCAATACFPWDVALAPNPEFEAFAGWVTTKPQPPQPPPPWDQFAPPALALLQGLLDLDPETRSPPLAVLDFLGDDWGLQGNREGPGVLGSAVSYEDREEGGSSLEEWTDEGDDSKSGGRTGTDGGAP\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Binding-Active site\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 13
        }
      ],
      "source": [
        "def convert_to_binary_list(binding_active_str, sequence_len):\n",
        "    \"\"\"Convert a Binding-Active site string to a binary list based on the sequence length.\"\"\"\n",
        "    binary_list = [0] * sequence_len\n",
        "    # Retrieve the indices in bindig/active sites and set their corresponding positions to 1\n",
        "    if pd.notna(binding_active_str):\n",
        "        indices = [int(x) - 1 for segment in binding_active_str.split(';') for x in segment.split(',') if x.strip().isdigit()]\n",
        "        for idx in indices:\n",
        "            if 0 <= idx < sequence_len: # Ensure the index is within the valid range\n",
        "                binary_list[idx] = 1\n",
        "\n",
        "    return binary_list\n",
        "\n",
        "# Apply the function to both datasets\n",
        "data['Binding-Active site'] = data.apply(lambda row: convert_to_binary_list(row['Binding-Active site'], len(row['Sequence'])), axis=1)\n",
        "data.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XEQJB-4hQx3G"
      },
      "source": [
        "## Split train/test sets"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        " Let's create a split of the data into training and test sets based on UniProt protein families, such that it ensures entire protein families are either in the training set or the test set. The goal is that the test set will contain completely \"new\" families of proteins that are not seen in the training set, so the evaluation represents the model's ability to generalize to entirely new families of proteins that it has not seen during training.\n",
        "\n",
        "Notably, this is different from the traiditional stratified split, which aims to preserve the distribution of classes across both sets."
      ],
      "metadata": {
        "id": "dmT0x6dcr9c6"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qZmJyW55Qx3G",
        "outputId": "a3cd21d2-5fa4-4dd9-827f-67be06275248",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Number of distinct protein families: 126\n"
          ]
        }
      ],
      "source": [
        "# Get the number of distinct protein families\n",
        "num_families = data['Protein families'].nunique()\n",
        "print(f\"Number of distinct protein families: {num_families}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ovg9UtLFQx3G",
        "outputId": "fa49b102-5712-47ce-bd73-5380782e05ae",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "392 1014\n"
          ]
        }
      ],
      "source": [
        "def split_data_by_family(data, test_ratio=0.20):\n",
        "    \"\"\"\n",
        "    Splits the dataset into train and test sets by entire protein families (not a family-stratified split!).\n",
        "\n",
        "    Parameters:\n",
        "    - data: pandas DataFrame containing the dataset with a 'Protein families' column.\n",
        "    - test_ratio: float, the proportion of the dataset to include in the test split.\n",
        "\n",
        "    Returns:\n",
        "    - test_df: pandas DataFrame containing the test set.\n",
        "    - train_df: pandas DataFrame containing the training set.\n",
        "    \"\"\"\n",
        "    # Get unique protein families and shuffle them to randomize the selection\n",
        "    unique_families = data['Protein families'].unique()\n",
        "    np.random.shuffle(unique_families)\n",
        "\n",
        "    # Loop through the shuffled families and add rows to the test set\n",
        "    test_rows = []\n",
        "    current_test_rows = 0\n",
        "    for family in unique_families:\n",
        "        family_rows = data[data['Protein families'] == family].index.tolist()\n",
        "        if current_test_rows + len(family_rows) <= int(test_ratio * data.shape[0]):\n",
        "            test_rows.extend(family_rows)\n",
        "            current_test_rows += len(family_rows)\n",
        "        else:\n",
        "            # If adding the current family exceeds the target, stop adding\n",
        "            test_rows.extend(family_rows)\n",
        "            break\n",
        "\n",
        "    # Create the test and train datasets\n",
        "    train_rows = [i for i in data.index if i not in test_rows]\n",
        "    test_df = data.loc[test_rows]\n",
        "    train_df = data.loc[train_rows]\n",
        "\n",
        "    return test_df, train_df\n",
        "\n",
        "test_df, train_df = split_data_by_family(data, test_ratio=0.20)\n",
        "print(test_df.shape[0], train_df.shape[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wkQrKwAFQx3G",
        "outputId": "0878620e-1c6f-4122-f53f-23407d53f170",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "         Entry                                   Protein families  \\\n",
              "39      O43252  APS kinase family; Sulfate adenylyltransferase...   \n",
              "68      O95340  APS kinase family; Sulfate adenylyltransferase...   \n",
              "4   A0A2P9DU05  Protein kinase superfamily, AGC Ser/Thr protei...   \n",
              "12      O00141  Protein kinase superfamily, AGC Ser/Thr protei...   \n",
              "22      O14578  Protein kinase superfamily, AGC Ser/Thr protei...   \n",
              "\n",
              "                                         Binding site Active site  \\\n",
              "39  62..67; 89..92; 101; 106..109; 132..133; 171; ...        None   \n",
              "68  52..57; 79..82; 91; 96..99; 122..123; 161; 174...        None   \n",
              "4                                                 121         214   \n",
              "12                                      104..112; 127         222   \n",
              "22                                      103..111; 126         221   \n",
              "\n",
              "                                             Sequence  \\\n",
              "39  MEIPGSLCKKVKLSNNAQNWGMQRATNVTYQAHHVSRNKRGQVVGT...   \n",
              "68  MSGIKKQKTENQQKSTNVVYQAHHVSRNKRGQVVGTRGGFRGCTVW...   \n",
              "4   MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...   \n",
              "12  MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...   \n",
              "22  MLKFKYGARNPLDAGAAEPIASRASRLNLFFQGKPPFMTQQQMSPL...   \n",
              "\n",
              "                                  Binding-Active site  \n",
              "39  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  \n",
              "68  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  \n",
              "4   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  \n",
              "12  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  \n",
              "22  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-7ed403a7-3485-4b7d-8f5c-9b5804be8e65\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Entry</th>\n",
              "      <th>Protein families</th>\n",
              "      <th>Binding site</th>\n",
              "      <th>Active site</th>\n",
              "      <th>Sequence</th>\n",
              "      <th>Binding-Active site</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>39</th>\n",
              "      <td>O43252</td>\n",
              "      <td>APS kinase family; Sulfate adenylyltransferase...</td>\n",
              "      <td>62..67; 89..92; 101; 106..109; 132..133; 171; ...</td>\n",
              "      <td>None</td>\n",
              "      <td>MEIPGSLCKKVKLSNNAQNWGMQRATNVTYQAHHVSRNKRGQVVGT...</td>\n",
              "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>68</th>\n",
              "      <td>O95340</td>\n",
              "      <td>APS kinase family; Sulfate adenylyltransferase...</td>\n",
              "      <td>52..57; 79..82; 91; 96..99; 122..123; 161; 174...</td>\n",
              "      <td>None</td>\n",
              "      <td>MSGIKKQKTENQQKSTNVVYQAHHVSRNKRGQVVGTRGGFRGCTVW...</td>\n",
              "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>A0A2P9DU05</td>\n",
              "      <td>Protein kinase superfamily, AGC Ser/Thr protei...</td>\n",
              "      <td>121</td>\n",
              "      <td>214</td>\n",
              "      <td>MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...</td>\n",
              "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>12</th>\n",
              "      <td>O00141</td>\n",
              "      <td>Protein kinase superfamily, AGC Ser/Thr protei...</td>\n",
              "      <td>104..112; 127</td>\n",
              "      <td>222</td>\n",
              "      <td>MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...</td>\n",
              "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>22</th>\n",
              "      <td>O14578</td>\n",
              "      <td>Protein kinase superfamily, AGC Ser/Thr protei...</td>\n",
              "      <td>103..111; 126</td>\n",
              "      <td>221</td>\n",
              "      <td>MLKFKYGARNPLDAGAAEPIASRASRLNLFFQGKPPFMTQQQMSPL...</td>\n",
              "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-7ed403a7-3485-4b7d-8f5c-9b5804be8e65')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-7ed403a7-3485-4b7d-8f5c-9b5804be8e65 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-7ed403a7-3485-4b7d-8f5c-9b5804be8e65');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-50f317fa-87c1-46a3-be88-3479ebb0f48d\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-50f317fa-87c1-46a3-be88-3479ebb0f48d')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-50f317fa-87c1-46a3-be88-3479ebb0f48d button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "test_df",
              "summary": "{\n  \"name\": \"test_df\",\n  \"rows\": 392,\n  \"fields\": [\n    {\n      \"column\": \"Entry\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 392,\n        \"samples\": [\n          \"Q9H1R3\",\n          \"A0A7P0TAS8\",\n          \"Q5TBH2\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Protein families\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 24,\n        \"samples\": [\n          \"Protein kinase superfamily, Ser/Thr protein kinase family, Haspin subfamily\",\n          \"Protein kinase superfamily, Tyr protein kinase family, Tie subfamily\",\n          \"APS kinase family; Sulfate adenylyltransferase family\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Binding site\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 220,\n        \"samples\": [\n          \"108\",\n          \"565..573; 588\",\n          \"89; 189..191; 189\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Active site\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 104,\n        \"samples\": [\n          \"218\",\n          \"786\",\n          \"1994\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Sequence\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 381,\n        \"samples\": [\n          \"MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQGGFGCIYLADMNSSESVGSDAPCVVKVEPSDNGPLFTELKFYQRAAKPEQIQKWIRTRKLKYLGVPKYWGSGLHDKNGKSYRFMIMDRFGSDLQKIYEANAKRFSRKTVLQLSLRILDILEYIHEHEYVHGDIKASNLLLNYKNPDQVYLVDYGLAYRYCPEGVHKEYKEDPKRCHDGTIEFTSIDAHNGVAPSRRGDLEILGYCMIQWLTGHLPWEDNLKDPKYVRDSKIRYRENIASLMDKCFPEKNKPGEIAKYMETVKLLDYTEKPLYENLRDILLQGLKAIGSKDDGKLDLSVVENGGLKAKTITKKRKKEIEESKEPGVEDTEWSNTQTEEAIQTRGPHERCGVLITGLPGDQNCLLNAIMT\",\n          \"MSSGTMKFNGYLRVRIGEAVGLQPTRWSPRHSLFKKGHQLLDPYLTVSVDQVRVGQTSTKQKTNKPTYNEEFCANVTDGGHLELAVFHETPLGYDHFVANCTLQFQELLRTTGASDTFEGWVDLEPEGKVFVVITLTGSFTEATLQRDRIFKHFTRKRQRAMRRRVHQINGHKFMATYLRQPTYCSHCREFIWGVFGKQGYQCQVCTCVVHKRCHHLIVTACTCQNNINKVDSKIAEQRFGINIPHKFSIHNYKVPTFCDHCGSLLWGIMRQGLQCKICKMNVHIRCQANVAPNCGVNAVELAKTLAGMGLQPGNISPTSKLVSRSTLRRQGKESSKEGNGIGVNSSNRLGIDNFEFIRVLGKGSFGKVMLARVKETGDLYAVKVLKKDVILQDDDVECTMTEKRILSLARNHPFLTQLFCCFQTPDRLFFVMEFVNGGDLMFHIQKSRRFDEARARFYAAEIISALMFLHDKGIIYRDLKLDNVLLDHEGHCKLADFGMCKEGICNGVTTATFCGTPDYIAPEILQEMLYGPAVDWWAMGVLLYEMLCGHAPFEAENEDDLFEAILNDEVVYPTWLHEDATGILKSFMTKNPTMRLGSLTQGGEHAILRHPFFKEIDWAQLNHRQIEPPFRPRIKSREDVSNFDPDFIKEEPVLTPIDEGHLPMINQDEFRNFSYVSPELQP\",\n          \"MDYSDTDSDATLGYSDDEDSSDEVQRISEEDVRTANVIAAEAVTCLVIDRDSFKHLIGGLDDVSNKAYEDAEAKAKYEAEAAFFANLKLSDFNIIDTLGVGGFGRVELVQLKSEESKTFAMKILKKRHIVDTRQQEHIRSEKQIMQGAHSDFIVRLYRTFKDSKYLYMLMEACLGGELWTILRDRGSFEDSTTRFYTACVVEAFAYLHSKGIIYRDLKPENLILDHRGYAKLVDFGFAKKIGFGKKTWTFCGTPEYVAPEIILNKGHDISADYWSLGILMYELLTGSPPFSGPDPMKTYNIILRGIDMIEFPKKIAKNAANLIKKLCRWFEGFNWEGLRKGTLTPPIIPSVASPTDTSNFDSFPEDNDEPPPDDNSGWDIDF\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Binding-Active site\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 16
        }
      ],
      "source": [
        "test_df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "O8SesehiQx3G"
      },
      "source": [
        "In case you don't want to keep the entire train/test datasets, you can create a smaller version (with a random representation of the original dataset). Uncomment the code below if that is the case:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iVQQHCogQx3G"
      },
      "outputs": [],
      "source": [
        "# # Percentage of data you want to keep\n",
        "# k = 0.05  # for keeping 5% of the data\n",
        "\n",
        "# # Generate random indices representing a percentage of each dataset\n",
        "# train_df = train_df.sample(frac=k, random_state=42)\n",
        "# test_df = test_df.sample(frac=k, random_state=42)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JFUI4dh9Qx3G"
      },
      "source": [
        "## Split sequences into chunks"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gfzzgqydQx3G"
      },
      "source": [
        "Sequences aren’t always of the same length. We will split the longer protein sequences and their lables into non-overlapping chunks of certain length or less to account for a given context window of ESM-2 models. Most protein sequences are on average 350 or so residues, so having longer context windows is often unnecessary, but keep in mind this will effect training time and batch size. Here, we pick a context of 1000."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "laMNBye0Qx3G"
      },
      "outputs": [],
      "source": [
        "def split_into_chunks(sequences, labels, chunk_size = 1000):\n",
        "    \"\"\"Split sequences and labels into chunks of size \"chunk_size\" or less.\"\"\"\n",
        "    new_sequences = []\n",
        "    new_labels = []\n",
        "    for seq, lbl in zip(sequences, labels):\n",
        "        if len(seq) > chunk_size:\n",
        "            # Split the sequence and labels into chunks of size \"chunk_size\" or less\n",
        "            for i in range(0, len(seq), chunk_size):\n",
        "                new_sequences.append(seq[i:i+chunk_size])\n",
        "                new_labels.append(lbl[i:i+chunk_size])\n",
        "        else:\n",
        "            new_sequences.append(seq)\n",
        "            new_labels.append(lbl)\n",
        "\n",
        "    return new_sequences, new_labels\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "govSX7K0Qx3G"
      },
      "outputs": [],
      "source": [
        "# Create lists of sequences and labels\n",
        "test_seq = test_df['Sequence'].tolist()\n",
        "test_labels = test_df['Binding-Active site'].tolist()\n",
        "train_seq = train_df['Sequence'].tolist()\n",
        "train_labels = train_df['Binding-Active site'].tolist()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LSTGHPNHQx3G"
      },
      "outputs": [],
      "source": [
        "# Apply the function to create new datasets with chunks of size \"chunk_size\" or less\n",
        "chunk_size = 1000\n",
        "test_seq_chunked, test_labels_chunked = split_into_chunks(test_seq, test_labels)\n",
        "train_seq_chunked, train_labels_chunked = split_into_chunks(train_seq, train_labels)"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The resulting train and test files will be exported to the same path where the input data file was located:"
      ],
      "metadata": {
        "id": "UlIKcBmmVkOq"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ll3LvDFJQx3G",
        "outputId": "4faac7d5-cd56-4937-95ec-eb050100a33b",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "('/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_test_labels_chunked_1000.pkl',\n",
              " '/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_test_sequences_chunked_1000.pkl',\n",
              " '/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_train_labels_chunked_1000.pkl',\n",
              " '/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_train_sequences_chunked_1000.pkl')"
            ]
          },
          "metadata": {},
          "execution_count": 21
        }
      ],
      "source": [
        "filename = os.path.splitext(os.path.basename(file_path))[0]\n",
        "dir = os.path.dirname(file_path)\n",
        "\n",
        "# Paths to save the new chunked pickle files\n",
        "test_labels_path =  os.path.join(dir, filename + \"_test_labels_chunked_\" + str(chunk_size) + \".pkl\")\n",
        "test_seq_path = os.path.join(dir, filename + \"_test_sequences_chunked_\" + str(chunk_size) + \".pkl\")\n",
        "train_labels_path = os.path.join(dir, filename + \"_train_labels_chunked_\" + str(chunk_size) + \".pkl\")\n",
        "train_seq_path = os.path.join(dir, filename + \"_train_sequences_chunked_\" + str(chunk_size) + \".pkl\")\n",
        "\n",
        "# Save the chunked datasets as new pickle files\n",
        "with open(test_labels_path, 'wb') as file:\n",
        "    pickle.dump(test_labels_chunked, file)\n",
        "with open(test_seq_path, 'wb') as file:\n",
        "    pickle.dump(test_seq_chunked, file)\n",
        "with open(train_labels_path, 'wb') as file:\n",
        "    pickle.dump(train_labels_chunked, file)\n",
        "with open(train_seq_path, 'wb') as file:\n",
        "    pickle.dump(train_seq_chunked, file)\n",
        "\n",
        "test_labels_path, test_seq_path, train_labels_path, train_seq_path\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Congratulations! Time to join the Community!\n",
        "Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:\n"
      ],
      "metadata": {
        "id": "LHxUMMWdFsNj"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)\n",
        "This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.\n"
      ],
      "metadata": {
        "id": "L96MXVKuFdTK"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Join the DeepChem Discord\n",
        "The DeepChem [Discord](https://discord.gg/cGzwCdrUqS) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!"
      ],
      "metadata": {
        "id": "1YX9impOFUeI"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Citing this tutorial\n",
        "If you found this tutorial useful please consider citing it using the provided BibTeX.\n",
        "\n"
      ],
      "metadata": {
        "id": "yMc5vOprV_LO"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "```\n",
        "@manual{Bioinformatics,\n",
        " title={UniProt data pre-processing for binding site prediction downstream task},\n",
        " organization={DeepChem},\n",
        " author={Gómez de Lope, Elisa},\n",
        " howpublished = {\\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/UniProt_Data_Preprocessing_for_Binding_Sites.ipynb}},\n",
        " year={2024},\n",
        "}\n",
        "```\n",
        "\n"
      ],
      "metadata": {
        "id": "NmuadXaIWcmg"
      }
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    },
    "colab": {
      "provenance": []
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}